Online Backpropagation Learning for a Human-following Mobile Robot

نویسندگان

  • Yang Wang
  • David Lee
چکیده

This paper investigates an on-line backpropagation learning system for a mobile robot that learns to follow a human. The scenario is part of a project to investigate Human-Robot Interaction within a social context. Because the environment is totally unknown to the system, training data have to be generated during the operation, for which a training data selection method is proposed. Two types of learning take place simultaneously in the system: the adaptive learning that learns slowly; the reactive learning that learns fast. These satisfy the system’s requirements for long-term adaptation and short-term reactivity respectively. The learning happens on-line and can adapt rapidly to the unknown environment. 1 Context of Research This paper presents research that is a part of a project that investigates autonomous learning of appropriate social distance by a mobile robot. This project aims to develop a robot that fulfils the principles of Human-Robot Interaction (HRI) in a social context with an emphasis on the maintenance of social space. Hall (1966) states that social space is an important element of any form of social organization and that the control of distance between agents is an important means of communication. A scenario proposed in the project is human-following by a mobile robot. An online learning method in this scenario is discussed in this paper. HRI has become a very active research topic. Adaptive systems, especially autonomous learning system, have been widely used because of the unpredictable dynamics in human-related interactions. One approach that has been adopted by many researchers is to analyse the human behaviour from experiments of people interact with others or a robot. By analysing the collected data, a mathematical model can be built and an adaptive system can be constructed for the robot to satisfy the interaction with human (e.g. Wood, 2006). Another commonly used method is to build the robot to interact in a creature-like way and to base the robot behaviour on existing well-studied physical and biological models (e.g. Arkin, 2003). Some key parameters in the model will be left to be adjusted by an adaptive system. Both methods have introduced social learning concepts and produced productive results. The control of social space has been studied in both approaches (e.g. Mitsunaga, et. al. 2005 and Nakauchi, 2002). However, we are trying to take a less studied perspective. Social behaviours are outcomes of social interaction, the foundation of which is responding to the attitude of the other (Ashworth, 1979). People behave based on the perception of others. Thus an attractive idea is to model the robot system to learn the behaviours of interaction according to the attitude of human without constraining the model into any specific scenario. While human’s attitude is associated with certain interaction, the robot will learn to interact in an appropriate way. In this sense, the attitude of the human will be a general reflection of their satisfaction with the interaction in which they are engaged. The human-following scenario has become the test bed of the simulation and the attitude of the human is designed to be associated with human-following behaviour. The sensory system of the robot must measure its position relative to the person, who in turn is able to input feedback reflecting his or her level of satisfaction, i.e. the attitude, with the robot’s current position. This sensor will contain a digital input device which the human will hold and use to provide feedback at any time. Similar devices to use in HRI have been studied in some previous researches (e.g. Koay, et. al., 2005). It will be connected to the robot mechanically in order that the position of the robot relative to the person can be measured. This sensory system will be simulated in the work reported here. The primary obstacle in this system is that the attitude of human is not, and can’t be directly transformed into, error measurements of the robot system. The study will focus on autonomous learning on this point and leave other dynamics for further research of the project. 2 Overview of the Algorithm Artificial neural networks (ANNs) form one of the most widely used autonomous learning methods. Error backpropagation (BP) learning has been well studied in the literature (e.g. Stroeve, 1998). As has been said, in the proposed scenario no direct error measurements exist but only an arbitrary performance reward score given by the human denoting their satisfaction with the interaction. Such a problem appears to fit the category of a policy-making algorithm and has been studied with reinforcement algorithms (Schaal, 1997). However, compared to previous studies, another problem we face is that the reward score has limited gradients and is discrete, and our scenario needs good generalisation, a hard problem for a conventional reinforcement network. Thus multi-layered feed forward (MLFF) ANNs with BP learning became our focus. A training data selection method is used to generate and optimize training data during simulations. The reward score given to the robot’s performance can thus control the learning of the system. On-line BP learning needs to review old data while it learns new patterns (Patterson, 1996). Adaptive learning is introduced to act as a reviewer in the system. This is a long-term learning process that gradually optimizes and generalizes the system performance through training, which is accompanied by a fast reactive learning procedure. The reactive learning enables the robot to respond to the attitude of the human quickly by minimising the error performance of the most recently collected data. The system consists of two small MLFF networks, with a set of matrices to select training data. The performance of the system has been tested in simulation and appears to be robust. The remainder of this paper is organised as following. Section 3 specifies the model of the people-following problem. Section 4 explains the structure and the learning algorithm of the system. Section 5 discusses the training procedures and demonstrates the simulation results and Section 6 evaluates the capabilities of the system. Finally, there is a short conclusion. 3 Model Specification A socially acceptable human-following robot must maintain the distance and position of which the target human most approves. Our focus in the study is to adapt the robot’s dynamics to reflect an unknown human preference about being followed. In the setting of the sensory system, the human in the simulation is simplified as a reward function of the robot’s position relative to the human. This function provides a single numerical score to indicate the human’s satisfaction. It represents a surface that relates the position to the reward score. Its peak refers to the most appropriate location. The robot is initialized so that neither the surface of the reward nor the consequences of its movements are known. Therefore it is essential that the robot learns proper actions to acquire the best position to take, relative to the human’s preference, i.e. the reward surface of the target human being. An assumption is made that the human will always maintain a constant velocity which the robot will acquire instantly as soon as it starts to follow. This allows the human to be considered as still by subtracting the constant velocity from the system. Another simplification is that all collisions are ignored (e.g. between the robot and the human). An example surface of the reward score is shown in Fig.1 Fig. 1. The contour plot of a reward surface: the contour lines illustrate the surface of the reward score that the human is able to provide relative to the robot’s position. The values of the reward score are marked on the contour line. The ‘flat surface’ marks the space without gradient. Fig. 1 is a map with a size of 3000 mm by 3000 mm. The target person shows a certain preference in both the angle and distance to the follower. It shows that the person would like to be followed behind on the left. The reward score belongs to the set [0, 1]: the higher the score, the higher the human’s satisfaction. The gradient of the reward only exists in a limited range. The area outside of this range maintains the lowest degree of satisfaction and is marked as the ‘flat surface’ that occupies most of the map. The surface is designed in this manner because it would be unrealistic for a human to input smooth continuous feedback over the whole space. We assume that the person has an (initially unknown) reward surface and that the robot’s objective is to find a sequence of movements that will take it to the position with the highest reward value. The system works in discrete time steps, where the interval between any two adjacent time indices is taken to be one second. A further assumption is made that the robot accomplishes any assigned movement instantly. The input to the system is the position of the robot and the outputs are the movements in the x and y axis in the next second, both of which are in the range of [-50, 50] mm. These assumptions were made only for the simplicity of simulation and can be relaxed easily without major change of the system configuration. 4 MLFF Network with BP Learning The system contains two MLFF networks, one for each output, which are the movements in x and y directions respectively. Separating the outputs into independent networks reduces the complexity of the system because the mapping dimensions are reduced for each network compared to an integrated one. The BP rule minimises output error, and the training data are the standard means of measuring such error. But this model provides no prior training data or any on-line information to use for training directly. Therefore a method of composing training data is required. 4.1 Training Data Selection MLFF BP learning requires the error value to be associated with each output. Such error cannot be measured directly as the targets of the training data are unknown. The feedback to the system is a single value denoting how the human feels about the current position of the robot. As the prerequisite of performing BP learning in this system, a simple method is introduced so that the training data can be generated, optimized and updated through the information collected during operation, which includes the history of positions and corresponding reward score. Suppose O(·) is the system output function so that at any time k, for the position input pattern p, the system gives the movement output O(p). Thus: ) ( 1 k k k p O p p + = + . (1) Let R(p) denote the reward at position p. Then the change of the reward score, l, caused by movement O(p) at position p is: ) ( ) ( 1 k k k p R p R l − = + . (2) This is all the information available during operation. l>0 means that the current move O(p) is triggering a positive change in the attitude of the human. Though there is no reason to take it as the best move, if p is the best rewarded place that the robot has achieved so far by making a move from position p, it is sensible to train the system with the input pattern p and desired output O(p). Because the learning of the system optimizes the moves, the chance of better movements happening at p exists. Such movements will be collected and will replace the old training data. In this way, the data selection and system learning enhance each other through time. The system output that has received the greatest increase in reward up to current time at each position is taken as the target for training with the current input pattern. In order to store the training data, Mr and M are introduced. Both represent 60╳60 possible input patterns, which divide the 3000 mm ╳ 3000 mm map into equal intervals, where Mr(n) is the greatest reward obtained by moving from position n and M(n) is the movement that earned that increase in reward. Thus a trainable input-target data pair is formed as [n, M(n)]. Mr is used to validate if new feedback from the human can be used to update the training data. The primary updating happens when l>0 and R(p)>Mr(s): ) ( ) ( ) ( ) (

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Navigation of a Mobile Robot Using Virtual Potential Field and Artificial Neural Network

Mobile robot navigation is one of the basic problems in robotics. In this paper, a new approach is proposed for autonomous mobile robot navigation in an unknown environment. The proposed approach is based on learning virtual parallel paths that propel the mobile robot toward the track using a multi-layer, feed-forward neural network. For training, a human operator navigates the mobile robot in ...

متن کامل

Mobile Robot Online Motion Planning Using Generalized Voronoi Graphs

In this paper, a new online robot motion planner is developed for systematically exploring unknown environ¬ments by intelligent mobile robots in real-time applications. The algorithm takes advantage of sensory data to find an obstacle-free start-to-goal path. It does so by online calculation of the Generalized Voronoi Graph (GVG) of the free space, and utilizing a combination of depth-first an...

متن کامل

Neural Network Controller for Mobile Robot Motion Control

In this paper the neural network-based controller is designed for motion control of a mobile robot. This paper treats the problems of trajectory following and posture stabilization of the mobile robot with nonholonomic constraints. For this purpose the recurrent neural network with one hidden layer is used. It learns relationship between linear velocities and error positions of the mobile robot...

متن کامل

Visual Tracking using Learning Histogram of Oriented Gradients by SVM on Mobile Robot

The intelligence of a mobile robot is highly dependent on its vision. The main objective of an intelligent mobile robot is in its ability to the online image processing, object detection, and especially visual tracking which is a complex task in stochastic environments. Tracking algorithms suffer from sequence challenges such as illumination variation, occlusion, and background clutter, so an a...

متن کامل

Adaptive Neural Control in Mobile Robotics: Experimentation for a Wheeled Cart

This paper presents experimental results of an original approach to the Neural Network learning architecture for the control and the adaptive control of mobile robots. The basic idea is to use non-recurrent multi-layer-network and the backpropagation algorithm without desired outputs, but with a quadratic criterion which spezify the control objective. To illustrate this method, we consider an e...

متن کامل

Mobile robot wall-following control using a behavior-based fuzzy controller in unknown environments

This paper addresses a behavior-based fuzzy controller (BFC) for mobile robot wall-following control.The wall-following task is usually used to explore an unknown environment.The proposed BFC consists of three sub-fuzzy controllers, including Straight-based Fuzzy Controller (SFC),Left-based Fuzzy Controller (LFC), and Right-based Fuzzy Controller (RFC).The proposed wall-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006